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Abstract 



This study illustrates the use of three least-squares models to control for rater effects in 
performance evaluation: ordinary least squares (OLS); weighted least squares (WLS); and 
ordinary least squares subsequent to applying a logistic transformation to observed ratings 
(LOG-OLS). The three models were applied to ratings obtained from four administrations 
of an oral examination required for certification in a medical specialty. For any single 
administration, there were 40 raters and approximately 115 candidates, and each candidate 
was rated by four raters. R2 values for the OLS and LOG-OLS models were comparable, 
while R2s were substantiaUy higher for the WLS model. The results indicated that raters 
exhibited significant amounts of leniency error, and that application of the least-squares 
models would change the pass-fail status of approximately 7% to 9% of the candidates. 
Ratings adjusted by the models demonstrated higher reliability and correlated slightly higher 
than observed ratings virith the scores on a written examination. 



Least Squares Models to Correct for Rater Effects 
in Performance Assessment 



Oral examinations have long pl^d an important role in evaluating an individual^ 
readiness for professional practice. In many professions, the oral survives as the final exam 
among many assessment hurdles. Of the 23 specialty boards that arc members of the 
American Board of Medical SpecialUes (ABMS), 15 include an oral examinaUon as a 
requirement for certification (ABMS, 1990). Psychology boards in numerous states include 
the successful completion of an oral examination as part of their licensure requirements 
(Hill, 1984). Oral examinations have achieved prominence in other areas, as well. For 
example, many, if not most, doctoral programs require an oral examination as part of the 
criteria for graduation. Interviews conducted for the purpose of making admission decisions 
may, in many instances, represent nothing more tiian an oral exam, both in terms of the 
methods used to elicit responses, and in terms of the constructs that are evaluated. Given 
the recent interest in alternative methods of assessment (Br^ & Byham, 1991; Linn, Baker, 
& Dunbar, 1991), one might expect an increase in the use of oral examinations. 

Supporters of oral examinations suggest that orals provide a method for evaluating 
psychological constructs that often elude conventional written examinations. The interactive 
nature of an oral can permit an examiner to evaluate a candidate's depth of knowledge in 
a particular area, skill at interpreting and evaluating the utility of diagnostic tests (e.g., 
slides, radiographs, other images), and skill at evaluating and selecting alternative methods 
of managing a particular case (Hill, 1984; Levine & McGuire, 1970; Muzzin & Hart, 1985; 
Watson, 1984). Interpersonal behaviors and communication skills can also be assessed 



during an oral examination. In fact, oral cxaniinations in some professions consist of actual 
work samples that require examinees to demonstrate proficiency with surgical instruments 
or to interact with patients (Uvine & McGuire, 1970; Small, 1982). The job-ielatedness of 
such exams is direct and obvious. 

Both supporters and critics of oral examinations acknowledge the limitations: they are 
expensive, subjective, unieliable, and may represent nothing more than a ilte ot passage 
(e.g., Muzzin & Hart, 1985; Watson, 1984). Perhaps the most serious criticism of oral 
examinations concerns the levels of reliability that are typically observed. Although 
interraier reliabilities have approached .80 in selected studies (O'Donohue & Weigin, 1978), 
it is also common to see interrater reliability coefficients in the .20s and 30s (Barnes & 
Pressey, 1929; Hubbard, 1971). Consequently, many oral examination programs make use 
of multiple raters in an effort to reduce the influence of measurement error and enhance 
reliability. 

Ratings of performance are susceptible to two general classes of measurement error: 
random and systematic. The random error component is what is typically regarded as rater 
unreliability. If all candidates in a group are evaluated by the same raters, then the 
reliability coefficient can be estimated by: 

p2 = i (1) 

2 2 

2 * 

where p* is the generalizability (reliability) coefficient, o, refers to the variance 
component due to candidates, a] refers to the variance component due to the error (i.e., 
residual variance), and /i, indicates the number of raters evaluating each candidate. These 



components of variance c»n be computed torn the mean squares reported for a candidate 
by rater ANOVA (Brennan, 1983; Shavelson, Webb, & Rowley, 1989). 

Expression (1) applies to complete rating designs. However, most performance ratings 
utilize a design in which each candidate is evaluated by a subset of raters. If an incomplete 
rating design is used, then the reliability estimate must acknowledge the error due to the 
fact that raters may be differentially lenient or harsh in their ratings. This systematic error 
is typicaUy referred to as leniency error. The reliability for many inconq>lete rating designs 
can be computed by: 



.2 . °i 



where a, is the variance component due to raters. If aU raters arc equally lenient (i.e.. 
the variance component due to raters is zero), then equations (1) and (2) will pix>vide 
equivalent results. This will seldom be the case, however. Given that performance ratings 
are frequently used to make important decisions about an individual^ career, any efifort to 
reduce the impact of measurement enw may have social utiUly. Rater training represents 
one common strategy for minimizing rating errors. However, training programs are time 
consuming and costly, and their effectiveness is questionable: although some studies shw^ 
positive effects, others show no effect, while still others have demonstrated a negative etkct 
(e.g., Bemardin, 1978; Bemardin & Pence. 1980; Borman, 1979; Hedge & Kavanaugh, 1988; 
King, 1983; Umz, Wright, & Unacre, 1990; Trier, 1983). A ^.ariety of different statistical 
models to correct for rating errors have also been proposed in the literature, including 
models based on item-response theoiy (de Gruijter, 1984; Lunz et al., 1990), multivariate 

analysis from incomplete data (Houston, R^ond, & Svec, in press), least-squares 

S 
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regression (Braun. 1988; de Gmijter, 1984; R^ond, Webb, & Houston. 1991; Wilson, 
1988), and other models (Cason & Cason, 1984). Prior research suggests that the use of 
statistical models can result m considerable reductions in measuiement error (Braun. 1988; 
Houston et al., in press). 

The purpose of this article is to describe and illustrate the use of a simple and flexible 
statistical model that can be s^jplied to incomplete rating designs in order to identify and 
correct for leniency error. Actual rating data are used, obtained from four administrations 
of a certification examination in a medical specialty. The magnitude of the adjustments, 
their effect on pass-fail decisions, and their impact on correlations with the scores on a 
written examination are estimated. The next section presents three variations of a least- 
squares model: ordinary least squares (OLS), weighted least squares (WLS), and OLS 
applied to logit-transformed ratings (LOG-OLS). Subsequent sections describe the results 
of applying the models to ratings obtained from four independent administrations of an oral 
examination in a medical specialty. 

Correction Methods 

OrcSnary tsasf Sqtiaras (OLS) 

Regression-based procedures to identify and correct for rater effects have been proposed 
by de Gruijter (1984) and Wilson (1988). A regression method for analyzing incomplete 
rating data postulates that an observed rating for a candidate is a function of the candidate^ 
true ability and a leniency or stringency effect associated with the rater providing that 
particular rating. The model also assumes an error component. The model can be 
represented as follows: 
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y,j = * «y (3) 

where y| is the rating given to candidate i by rater j, 
Oi is the true rating for candidate i, 

is the bias (i.e., leniency) index for rater j, and 
ei is random error. 

The model assumes that the error terms haw an expected value of zero and that the 
variance of the errors across raters is equal. 

Let aj be an estimator of a,, a candidate^ true level of performance. Let bj be an 
estimator of fi^, the magnitude of leniency or stringency error for rater j. We refer to this 
error as bias throughout the paper, consistent with the notion that the error is systematic 
as opposed to random. If candidate i is rated by aU raters, then any estimator of a, that 
sums or averages the observed ratings is free from rater leniency effects (i.e., is an unbiased 
estimator of a). If, however, candidate i is not rated by all raters, then estimatore of a, vill 
contain a bias component, unless jS^ = 0 for all j, which is an unlikely circumstance. 

The model in expression (3) can be estimated through least-squares regression. Let K 
be the total number of observed ratings assigned by p raters to n candidates. Then the 
matrix formulation for the OLS model is: 



y =x 



* e 



(4) 



where y is a (K x 1) vector of observed ratings, 
X is a (K x (n + p . 1)) design matrix, 
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a is an (n X 1) vector of true ratings for candidates, 

fi is a (p - 1) X 1 vector of rater bias indices, and 

e is an (K X 1) vector of random errors. 
The design matrix, X, consists of n + p - 1 columns; the column for the last rater is 
dropped to avoid a linear dependency in the columns of X. Because X is of full-cclumn 
rank, the parameters can be estimated by any standard multiple-regression algorithm. The 
appendix presents an example of an incomplete rating matrbc for a sample of five candidates 
and three raters, as well as the corresponding design matrix. For all candidates and all 
raters except for the last rater, the numeral 1 is used to indicate the candidate and rater 
vkith which each observed rating is associated; otherwise a zero is used. The ratings 
associated with the last rater are implied by coding the other p - 1 raters with a minus one 

(-1). This coding strategy produces the convenient and useful result that the are in 

p 

deviation form (i.e., 53 * ^ )• P^an^eter estimates arc then obtained through ordmary 
least-squares regression, where 

the last rater has been dropped, the parameter estimate for that rater will be missing from 

p-i 

the OLS solution. The estimate for that rater is obtained by negative of the 

sum of the parameter estimates for the other p - 1 raters. 

VK^/riBcf laasf St^iares (WLS) 

The OLS procedure provides an unbiased estimate of the vector of true ratings. If, 
however, the consistency of scoring varies across raters (i.e., if correlations among raters vary 
considerably), then the usual regression assumption of equal error variances across all 
candidates and raters is violated. The statistical consequence is that the variances of the 



J = (X'JC)'* X'y . As the vector in X corresponding to 
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parameter estimates will be inflated (Draper & £:iuth« 1981). The practical consequence 
of the inconsistency is that the parameter estimates of candidates who were evaluated by 
inconsistent raters will be less accurate than the estimates associated with consistent raters. 

Wilson (1988) suggested a two-stage regression procedure consisting of ordinary least 
squares, as described above, followed by weighted least squares. The weights for the second 
stage, which give less influence to inconsistent raters in the determination of the parameter 
estimates, are derived as follows. For each candidate/rater pairing that results in a rating, 
a residual is computed to indicate the accuracy with which that rater^ observed rating 
corresponds to the rating predicted by the model. If evaluator j provides 10 ra;ir^ tne 
mean squared residual (MSRj) based on those 10 ratings provides an index of evaluator 
consistency. The reciprocals of the mean squared residual (1/MSRj) for all raters can then 
be used to derive weights for use in a generalized least squares analysis to obtain revised 
estimates of the candidates' true scores. The WI^ parameter estimates are given by: 



where IF is a K by K diagonal matrix of weights, with the elements of IT corresponding to 
the value of 1/MSRj for each rater. 

OLS Af^ied to Logit^mns/mmd RaUn^ 

The presence of floor or ceiling effects in the observed ratings will compress individual 
differences at the two ends of the rating scales. Such effects can be compensated, in part, 
through the use of a nonlinear transformation. The most commonly employed nonlinear 
transformations are the probit and logit transformations (Cohen & Cohen, 1983; Lord, 1980; 
Wright & Stone, 1979). Although the logit and probit transformations achieve similar 
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outcomes, the distribution is stretched more with the logit transfonnation (Cohen & Cohen, 
1983). 

For the present study, we assessed both the rater effect and any floor/ceiling effects by 
also applying the OLS model to logit-transformed ratings. The Jogit transformation was 
effected as .5*ln(P/(l-P)) where P was the ratio of the observed to maximum score possible. 
The LOGOLS model assumes that observed ratings are compressed at the two ends of the 
scale in that the abilityobservcd rating relationship is not linear but takes the form of an 
ogive and corrects for such compression first, before correcting for rater effects. The 
mathematical tractability of the logit transfonnation (Cohen & Cohen, 1983; Lord, 1980) is 
one reason for its popularity over probit models in psychometrics. Also, since the logistic 
function approaches its asymptotes less rapidly than the probit model, aberrant ratings do 
lesser harm to the logistic model than to the probit model (Lord, 1980). That is, the logit 
transformation is less sensitive to random error than the probit transformation. The impact 
of the transformation can be evaluated by comparing the model fit of the LOG-OLS model 
to the fit of the OLS model with untransformed ratings. 

Method 

Rating Data 

Data were comprised of observed ratings from four operational administrations of an 
oral certification examination administered by a medical specialty board. All candidates 
look a written multiple-choice exam in that specialty prior to participating iii the oral 
examination. Only those candidates whose written scores fell within the middle range of the 
distribution were caUed to the oral exam. That is, for any given year, approximately 30% 
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of the candidates who took the written exam also took the oral; those with high written 
scores were exempted, while those with low written scores were ineligible. 

The oral examination consists of 16 clinical cases, complete with the results of lab tests, 
x-r^ pathology slides, and other diagnostic information. The cases are organized into four 
subspecialties. Within each subspecialty, four separate clinical cases are presented to each 
candidate. Although all candidates receive the same cases, candidates are not evaluated by 
the same raters. Specifically, each candidate is evaluated by four raters, and each rater 
presents the same four cases to all candidates evaluated. Raters are nested within topics, 
which are crossed with candidates. Candidates are given ratings on three dimensions 
(factual recall; interpretation of data; clinical problem-solving) by each rater. The three 
ratings are on a Ukert-type rating scale ranging from 1 to 12 and serve as indices of 
performance in each subspecialty. The sum of the three ratings is obtained as an index of 
overall performance, thus producing a rating scale with a possible range from 3 to 36 for 
each subspecialty (correlations in the .80s among the three dimensions support this simple 
combination of ratings). A candidate's observed final rating is the mean rating obtained 
over the four raters (in four subspecialties) who evaluated that candidate. Therefore, the 
final observed rating scale ranges from a low of 3 to a high of 36 with one-fourth point 
intervals. 

A total of 456 candidates was examined over a four-year period (1987 to 1990). In aiiy 
one year, approximately 115 candidates were examined by 40 raters; each candidate was 
examined by four raters, and each rater examined from 7 to 14 candidates with an average 
of 11.5 candidates. In any given year, the entire data matrix (40 raters by 115 candidates 
= 4,600 possible observations) was about 10% complete. Prior to the administration of the 
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actual examination in any given year, the raters participated in a practice session during 
which they rated videotapes of candidates and received feedback regarding the similarity of 
their ratings to those of their peers. 

Data horn the four years were subjected to three variations of least-squares regression: 
OLS, WLS, and LOG-OLS. Data resulting from application of the least-squares models 
were subjected to four types of analyses. First, we evaluated the fit of the three models to 
determine if the more complex models (WLS and LOG-OLS) were more accurate than the 
simpler model (OLS). Model fit was evaluated by examining residual plots and computing 
the proportion of variance in obserwd ratings attributable to the candidate effect and rater 
effect (R2). Second, the model-adjusted ratings (a) were compared to observed ratings and 
subjected to various descriptive analyses in order to gain an understanding of the magnitude 
of the adjustments. If the values of o are essentially identical to the observed ratings, then 
there would be little reason to use the models. Third, since the oral examination is used 
to certify physicians in specialty practice, we evaluated the impact of the model-adjusted 
scores on pass/fail decisions. The analyses were conducted to determine the decision 
consistency between model-adjusted ratings and observed ratings. Fourth, we conducted 
analyses to determine if the models altered the intercorrelations among ratings (i.e., 
interrater reliability) and correlations of oral ratings with the scores on a written 
examination. Measurement error attenuates correlations among variables; therefore, we 
expected that the model-adjusted ratings would correlate more highly than observed ratings 
with each other and with scores on a written examination. The data from the four years 
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were analyzed separately to assess the feasibility of applying the least-squares models in a 
real world setting 

Results 

Table 1 provides the values for the three models for each of the four )«are analyzed. 
Comparisons of the values indicate a better fit for the WLS model for the four data sets. 
Hiis result occurs l^cause there are differences in the error variance of ratings among 
raters, and this source of variation is modelled by the WLS model, fhe R^ in Table 1 also 
indicate that the logistic transformation did not result in improved model fit for the four 
years considered. In additioo, an analysis of the residual plots indicated that a logistic 
transformation was unnecessary. 



'Table 1 

for Three Least Squares Models 



Year 


OLS 


WLS 


LOG-OLS 


1 


.54 


.57 


53 


2 


39 


.70 


.59 


3 


.52 


.60 


52 


4 


.54 


.65 


56 



The differences in fit among the three models can be regarded as evidence that 
differences in rating variability across raters introduce more error than floor or ceiling 
effects for the present data. The lack of floor or ceiling effects could be due to the fact that 
the scale is quite broad, ranging from 3 to 36. Since logit transformation of the observed, 
untransformed ratings resulted in no appreciable increase in model fit for any of the four 
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years, further discussions will be limited to the OLS and WIS models using the 
untransfonned ratings only. 

The rater effect was found to be statistically significant (p < .01) for all four years, 
indicating the presence of leniency error. The variance components and reliability 
coefficients arc presented in Table 2. As Table 2 indicates, the rater effect is 
appreciable-about one-half the magnitude of the candidate effect for any year. The 
reliability of four ratings ( ) ranges from .48 to .56 with an average of .52. 



Variance Components and Reliability Coemclents 



Year 


a? 








1 


8.36 


3.23 


25.01 


.54 


2 


8.70 


5.36 


22.15 


.56 


3 


7.05 


4.13 


26.36 


.48 


4 


8.08 


4.45 


29.92 


.49 



For the OLS model, estimated bjS for individual raters ranged from -6.15 to 7.24 for the four 
years; the average (over four years) minimum and maximum values were -5.22 and 6.03. 
The mean of the absolute values of b, ranged from 1.87 to 2.41, with an overall mean across 
all four years of about 2.11. That is, aity rater drawn at random would be expected to be 
biased by a little more than two points. 

The lev^sls of bias for the WLS model were very similar to those for the OLS model. 
However, the WLS model weights each examiner^ ratings by the reciprocal of their mean 
squared residual (1/MSRj). The values of MSR^ for the OLS model did exhibit considerable 
variability, as suggested by thj increased model fit for the WLS model. Specifically, the 
average (over four years) minimum value of MSR^ was about 4.26, and the maximum value 
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was 43.12. The mean value of MSR^ over the four years was 17.22. It is important to note 
that these values are squared residuals. 

Properties of A:^usted Ratings 

Descriptive statistics were computed for each of the four years in order to compare the 
distributional properties of the observed rating and model-adjusted ratings. The 
disttibutional properties of the observed and model-adjusted ratings were very similar for 
all years and are provided in Table 3. 

Tabled 



Descrfpthre Statistics for Ottserved and IModei-AdJusted Ratings 



Year 


Rating 
Type 


N 


Mean 


S.D. 


Min. 


Max. 


1 


Obs. 


129 


22.65 


3.78 


13.25 


3125 




OLS 




22.66 


4.03 


13.19 


32.57 




WLS 




22.65 


3.93 


12.90 


3220 


2 


Obs. 


114 


22.73 


3.89 


13.25 


30.00 




OLS 




22.81 


3.85 


13.19 


30.47 




WLS 




22.81 


3.82 


11,92 


30.03 


3 


Obs. 


121 


2229 


3.81 


11.00 


31.50 




OLS 




22.29 


3.73 


11.85 


29.89 




WLS 




22.29 


3.73 


11.90 


30J9 


4 


Obs. 


92 


2U0 


4.12 


9.50 


31.50 




OLS 




21.13 


4.07 


10.58 


31.52 




WLS 




21.14 


4.10 


10.82 


30.77 



The correlations among the model-adjusted ratings and the observed ratings ranged 
from .90 to .98. These high correlations indicate that the rank ordering of candidates is not 
affected sigmficantly by the model adjusmients. As expected, the correlation between the 
WLS-adjusted ratings and observed ratings was lower than the correlation between OLS- 
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adjusted ratings and the observed ratings. This is consistent with the fact that the WLS 
model results in a greater adjustment of ratings by differential weighting of raters- 

For norm-referenced selection decisions, the correlations among model-adjusted and 
observed ratings would be of primary interest; that is, one would be concerned primarily 
with rank order information. Within the context of a domain-referenced examination that 
utilizes an absolute cut-off score, rank order information is insufficient for evaluating the 
potential impact of the OLS and WLS models. In such applications, the magnitude of 
adjustments is of critical importance. To examine the magnitude of the adjustments 
imjKwed by the OLS and WLS models, a difference score was computed for each candidate 
by subtracting the model-adjusted ratings (based on both OLS and WLS) from the observed 
ratings. 

As indicated in Table 4, the magnitude of the adjustments exceeded ±3.0 points and 
approached ±5.0 points for year 4. As expected, the magnitude of the adjustments was 
greater for the WLS model than for the OLS model. The average adjustment over the four 
years for the OLS-adjusted ratings was 1.01, while the average adjustment for the WLS 
models was L26. One wsy to gauge the relative magnitude of these adjustments is to 
compare them to the standard deviations of the ratings (about 3.90). The magnitude of the 
adjustments is about 26 SD units for the OLS model and 32 SD units for the WLS model. 
A more conservative index of adjustment can be obtained by using the range of the scale 
as the basis for comparison. Using this index, the magnitude of the typical OLS adjustment 
is about 5.2% of the rating scale, and the typical WLS adjustment is about 6.5% of the 
rating scale. Adjustments of this magnitude may affect the pass/fail decisions of borderline 
candidates. 

19 
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Descriptive Statistics for Differences Between 
Modei-Adjusted Ratings and Obseived Ratings 



Year 


Model 








Min 


Max 


Mean 
(Absolate) 


1 


OLS 


-3.06 


3.04 


0.89 




WLS 


-327 


4.16 


Ul 


2 


OLS 


-2.94 


4.14 


1.11 




WLS 


-4.47 


4.53 


L16 


3 


OLS 


-3.91 


3.40 


0.97 




WLS 


-3.53 


4.89 


U2 


4 


OLS 


-325 


4.09 


1.08 




WLS 


•4.62 


4.91 


1.47 



impact on Pass/Fail Decisions 

The consistency of pass/fail decisions for observed ratings and model-adjusted ratings 
was evaluated. For each year, several artificial but realistic cut-off points were imposed. 
The pass points were selected so that the pass rates would range from 70% to 90%-pass 
rates typical of a certification exam in the health professions. Multiple oit-off points were 
used so that the impact of model adjustments on the pass/fail decisions could be examined 
over a wider range of the distribution of ratings. This will help ensure that conclusions 
about the stability of the impact of adjustments on pass/fail decisions are not due to local 
abnormalities in the distribution. The effects of the adjustments on pass/fail decisions are 
summarized in Table 5. 

The pass rates for each cut-off score resulting from the use of adjusted ratings are 
presented, along with the pass rates resulting from the use of observed ratings. The pass 
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mes were similar for all three types of ratings. This result is consistent with the fact that 
the least-squaies models did not significantly alter the overall distribution of ratings. Also 
presented in Table 5 are the rates of disagreement for the pass/fail decisions produced by 
the OLS and WLS models, as compared with pass/fail decisions for the observed ratings. 
The disagreement rates range from 25% to 14.1%, and are consistently higher for the WLS 
model. The average rate of disagreement is 6.8% for the OLS model and $5% for the 
WLS model. Further analysis also revealed that the percentage of decisions changing from 
pass to fail was about the same as the percentage of decisions changing from fail to pass. 

It can be seen that the dis^reement rates increase as the passing point approaches the 
mean. Deviations from this trend reflect local characteristics of the rating distribution. The 
trend for disagreement rates to increase as the passing point approaches the mean is due 
to the fact that the opportunity for chance agreement decreases toward the midpoint of the 
distribution. Although the Kappa coefficient (Cohen, 1960) corrects for chance agreement, 
and would eliminate this trend, its use in the present context seemed to be more of a 
hinderance than an asset. 

Impnxements In ^kSty 

As the purpose of the least-squares adjustments is to reduce measurement error, it 
seemed important to empirically determine the extent to which rater reliability is actually 
improved. This was done by obtaining the variance components of the model-adjusted 
ratings and computing the reliability coefficients. Stanley (1961) has noted that rater 
agreement can be interpreted as evidence supporting the construct validity of ratings. The 
variance component analyses produced two noteworthy findings. First, the variance 
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Consistency of tass/FafI Decisions Based on Observed and 
Adjusted Ratings for Selected Points 
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component due lo raicre for the model-adjusted ratings went to zero. This result was 
expected since the least-squares models statistically remove the systematic variation among 
raters. Second, the reliability increased from an average of 32 (Table 2) to an average of 

.63 for the four years. 

The validity of the model-adjusted ratings was further examined by computing the 
correlations between the ratings (obserwd and model-adjusted) and scores obtained on a 
written certification examination. As noted earlier, only 30% of all candidates who took the 
written exam also qualified for the oral examination: high scoring candidates are exempted, 
and low scoring candidates are disqualified. As this rather severe range restriction 
substantially depresses the observed correlations, a correction for range restriction was 
applied. 



Table 6 

Correlations of OLS, WLS, and Obsonred Ratings 
With WIrmen Tsst Scor^^ 



Year 


Observed 


OLS 


WLS 


1 


.34 (.70) 


33 (.69) 


33 (.69) 


2 


M (.76) 


.41 (.81) 


39 (.79) 


3 


.17 (.45) 


.19 (.49) 


.20 (.51) 


4 


.14 (.38) 


.21 (.53) 


.23(36) 



Table 6 presents the uncorrected and corrected correlations between written scores and 
the three sets of ratings. Overall, the model-adjusted ratings correlate more highly witii the 



'The values within parentiieses are correlations corrected for range restriction. 
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scores in the written examination than the observed ratings although the magnitude of the 
improvement is modest. The average increase in correlations over the four years is .06 for 
the OLS model and .07 for the WLS model. The results indicate that the model-adjusted 
ratings are more reliable than observed ratings, and that the increased reliability enhances 
their relationship with an external criterioa 

Discussion 

This investigation illustrates the feasibility of applying three variations of least-squares 
regression to correct for rater errors in performance evaluations. Clearly, the efficacy of the 
regression methods are a function of the degree to which the models fit the rating data, as 
well as the magnitude of the rater effect (i.e.. leniency error). The rater effect was 
statisticaUy significant for each of the four years. Although the values of bj reached nearly 
±5 points, the average level of bias was a little over ±2 points. values for the OLS 
model ranged from .52 to .59 over the four years, while the values for the WLS ranged 
from .57 to .70 for the four years. The use of the logit transformation was not necessary 
for the presem data, as suggested by the R^s for the LOG-OLS models. An analysis of the 
residual plots confirmed the suitability of the linear model. It is likely that the original 
rating scales-a series of three 12-point Ukert scales-were sufficiently broad to deter floor 
and ceiling effects. It is certainly possible that the use of narrower scales would have 
produced detectable floor and ceiling effects. 

For the present sets of data, correcting for rater effects made a modest difference in the 
overall rating received by the typical candidate. The OLS model resulted in about a ± 1- 
point adjustment on average, although the magnitude of the adjustments for several 
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candidates fell in the ±3.0-point range. Adjustments based on the WLS model were about 
25% larger, with the average adjustment a little more than 1^ points and several 
adjustments felling in the 3.0- to 4.0-point range. For both models, a few adjustments 
exceeded 4.0 points. In the present rating situation, the assignment of four examiners to 
each candidate helped minimize the negative impact due to the bias of any single rater. 
That is, lenient and harsh raters tended to cancel one another to some extent Since the 
index of rater bias, bj, was more than two points on average, it is clear that the magnitude 
of the adjustments imposed by the OLS and WLS models would certainly have been larger 
had only two or three raters been assigned to each candidate. 

As indicated in Table 5, the modest levels of leniency error altered the pass/fail 
decisions for selected candidates. The results indimed that using OLS-adjusted ratings 
would result in changed decisions for approximately 7% of the candidates in any single year. 
Use of the WLS model would alter the decisions for 8% to 9% of the candidates. Is this 
an important effect? In the context of the present examination, it is. The candidates were 
physicians seeking board certification in a specialty area. Nearly all of the candidates had 
completed four years of college, three to four years of medical school, and at least another 
three years of residency training in this particular medical specialty. Even a few erroneous 
pass/fail decisions can have significant personal, social, and economic consequences. 

Since the data were not perfectly modelled, it is difficult to know whether all of the 
adjustments to the ratings were appropriate. That is, the least-squares models could have 
resulted in a candidate^ status being incorrectly changed from pass to fail or vice versa. In 
addition, ♦liere are likely to be some candidates whose pass/fail status should have changed 
but did not. However, the improvements in reliability for all years, as well as the increases 
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in the correlations with written scores for three of the four years sugg^t that the 
adjustments to the ratings imposed by the least-squares modeU were, in fact, in the 
appropriate direction for most candidates. That is, the model-adjusted ratings appear to 
improve the construct validity of the ratings. 

The utHity of the least-squares models is a direct function of model fit In particulai; 
the adjustments to the ratings wiU be laiger and more beneficial as the variance component 
due to error decreases and the variance component due to the rater effect increases. 
Although the values in the present study are less than what have been observed for essay 
ratings (Braun, 1988), it is generally the case that essays can be rated with far more 
reUability than performance in an oral examination. The values in the present data are 
also less than those reported by Cason and Cason (1984), who applied a rather complex, 
iterative, least-squares model to probit-transformed ratings. In short, the degree of model 
fit for the present data should be regarded with cautious optimism. 

Empirical research on model fit is lacking. Consequently, it is difficult to 
unambivalemly advocate the use of the least-squares models on an operational basis. In the 
presence of borderiine model fit, one is faced with a difficult decision. On one hand, less- 
than-desirable model fit m^ discourage the use of model-adjusted ratings. On the other 
hand, poor model fit suggest that even the observed ratings are not reliable enough for 
making important decisions. The choice is not an easy one; more research is needed to 
establish guidelines. The results of one empirical investigation using numerous sets of 
simulated rating data clearly supported the use of three models-OLS, WLS, and imputing 
missing ratings via the E-M algorithm (Houston et al., in press). However, the simulated 
data comained less error variance than the present data, and the rating designs were more 
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complete; therefore, the results of thai study do not necessarily generalize to the data 
investigated in the present smdy. Within the context of ess^ ratings, Braun (1988) 
demonstrated that, under certain circumstances, the use of an OLS model could actuaUy 
result in greater increments in reliability than could be obtained by doubling the number 
of raters. The least-squares models appear to offer a viable method of reducing the 
negative consequences of ratii^ errors. Future sunulation smdics wiU need to investigate 
the efectiveness of the least-squares models using data that are consistent with the levels 
of reliabiUty typically observed in oral examinations (Muzzin & Hart, 1985) and in work 
settings (Rothstein, 1990). 

In addition to detecting and correcting for rater effects, the least-squares models can 
generate statistical information useful for describing the psychometric properties of the 
rating data. For example, in the present study the index MSR^ was used for computing 
weights for the WLS model. Since MSRj is inversely related to ^he correlation of rater j 
with all other raters, it can be interpreted as an index of rater reliabiUty, rater 
discrimination, or rater fit. It is parenthetically noted that the OLS and WLS models can 
be loosely interpreted within an item-response theory framework, wherel^ the difficulty 
parameter of a rater is described by bj, and the slope is a function of MSRj. The presence 
or absence of the logit transformation will determine whether the model is linear or 
conforms to the logistic function. 

The least-squares regression models provide the many useful inferential statistics (e.g., 
tests that b^ - 0 for each j) and diagnostic procedures (e.g., residual plots) that are 
described in selected texts on linear models (e.g., Belsley, Kuh, & Welsch, 1980; Draper & 
Smith, 1981). Residuals can also be cumulated across candidates, and an index MSRj can 
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be used to describe the degree to which each candidate fits the underlying rating model. 
A large value of MSRj would indicate that candidate i has not, for some reason, been 
measured as well (or on the same constructs) as other examinees. In item-response theory, 
conceptually similar indices are interpreted as measures of "appropriateness measurement" 
(Hulin, Drasgow, & Parsons, 1983; Levine & Rubin, 1979) or "person fit" (Wright & Stone, 
1979). Values of MSR^, when computed within selected demographic groups (e.g., males, 
females), can also be used to study issues related to fairness and bias. The least-squares 
models also provide standard errors for each candidate and each rater. For example, the 
standard errors associated with each candidate's adjusted rating for the OLS model exhibit 
minimal variability, whereas the standard errors based on the WLS model exhibit 
considerable variability, as they are sensitive to the differential levels of error variance 
associated with the individual raters who assigned the ratings. 

Although the present study addressed rating errors within the context of oral 
examinations, the least-squares models can be applied in many other rating circumstances. 
The rating design must be structured so that each candidate is evaluated by more than a 
single rater and each rater evaluates more than a single candidate. Also, candidates and 
raters should be crossed (incompletely); cohorts of students cannot be nested within fixed 
teams of raters. That is, a certain degree of overlap must exist in the rating data in such 
a way that each rater can be directly or indirectly linked to each of the other raters (de 
Gruijter, 1984). On the surface, the incidence of rating designs that meet these 
requirements may seem quite low. To the contrary, such designs often naturally occur, or 
could often occur in many practical contexts such as: interviews of job applicants, ratings 
given at assessment centers, reviews of manuscripts or proposals, evaluation of faculty by 
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review conunittees, and institutional accreditation (hospitals, universities), to name a few. 
Grades in college courses also represent an instance of an incomplete rating design. 
Previous investigations have used either a simple additive model (Elliott & Strenta, 1988) 
or an IRT graded-response model (Young, 1990; 1991) to adjust college grades for leniency 
error. The OLS models presented in this paper appear to offer another alternative for 
adjusting cuilege grades. 

Controlling rater error (leniency, discrimination) has heretofore been left to the good 
mtentions, but limited effects, of rater training programs (e.g., Bemardin & Pence, 1980; 
THer, 1983). It is possible that statistical methods to control for this type of rater error may 
prove to be a useful and inexpensive adjunct to rater training. We hope that future research 
will address this possibility. 
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Appendix 



The following tables present a sample rating matrix suitable for least-squares recession and 
the corresponding design matrix. Ratings on a seven-i^int scale are provided in the cells. 

Incomplete Rating D^lgn Suttat)le for Methods to 
Correct for Rater Effects 



Rater 

Candidate A B C 

1 3 2 

2 3 3 

3 5 4 

4 5 4 



Design Ma^ix for OLS Method Based on Data in liable Above 
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